perf(dedupe): skip unchanged rows, VALUES fast-write, prefetch vulnerability_ids by valentijnscholten · Pull Request #15046 · DefectDojo/django-DefectDojo

valentijnscholten · 2026-06-19T20:55:27Z

What

Performance optimizations for the hash-recompute / dedupe write path in mass_model_updater and Finding.get_vulnerability_ids.

The dedupe recalculation command performance relies heavily on bulk_update. Turns out this is performing a very expensive part of code to calculate a bit CASE WHEN .... clause. Because this is a hot path and the dedupe only updates 1 (or 3) hash code fields, I feel it's OK to use raw SQL here as it saves huge amounts of time.

`mass_model_updater` (`dojo/utils.py`)

skip-unchanged (skip_unchanged=True default): rows whose tracked fields weren't changed by the update function are no longer written. Re-running over already-correct data now issues zero UPDATEs. The old value is read from the page query (via __dict__, so deferred fields don't trigger an extra query).
writer hook: optional writer(model_type, batch, fields) callable to replace bulk_update for persisting a batch; defaults to bulk_update.

VALUES fast-write (`dojo/finding/deduplication.py`)

hashcode_values_writer writes the text hash columns with a single UPDATE … FROM (VALUES …) join instead of bulk_update's per-row CASE/WHEN expression tree (which is O(rows×fields) Python nodes to build/resolve and dominates wide-batch writes). PostgreSQL only; falls back to bulk_update on other backends. Values are bound as parameters.

Setting: `DD_MASS_HASH_CODE_USE_SQL_WRITER` (`dojo/settings/settings.dist.py`)

New env-backed setting (MASS_HASH_CODE_USE_SQL_WRITER, default True). When True, the dedupe hash_code pass uses the PostgreSQL VALUES-join fast writer; set False to always fall back to bulk_update (escape hatch if the SQL writer misbehaves on a given deployment). The default writer was also extracted into default_mass_model_writer so the bulk_update path is a named callable.

vulnerability_ids N+1 (`dojo/models.py`)

get_vulnerability_ids now reads finding.vulnerability_id_set.all() (prefetch-aware) instead of a fresh Vulnerability_Id.objects.filter(...) + .count(). For parsers whose hash config includes vulnerability_ids this was a per-finding COUNT + SELECT N+1.

dedupe command (`dojo/management/commands/dedupe.py`)

prefetch vulnerability_id_set; use the VALUES writer for the hash pass.

Why

The dedupe/recompute write path scales poorly on large finding sets: bulk_update builds a per-row CASE/WHEN for the updated columns (slow to build/resolve for wide batches), it writes every row even when the value is unchanged, and get_vulnerability_ids issued a query per finding for affected parsers. These changes remove all three: unchanged rows aren't written, changed rows use a VALUES join, and the vuln-id lookup honors prefetch.

Concretely, on a ~15k-finding all-changed batch the VALUES join avoids ~14s of Python spent building and resolving bulk_update's CASE/WHEN expression tree (the resolve_expression cost in profiling). That figure is purely the write mechanism — independent of any caching. The skip-unchanged and prefetch wins on top are workload-dependent (driven by how many rows actually changed and whether the parser hashes on vulnerability_ids).

Tests

unittests/test_mass_model_updater.py — skip-unchanged, writer hook, VALUES writer (incl NULL handling), fields=None side-effect mode. All green on PostgreSQL.

Compatibility

Backward-compatible: skip_unchanged only skips no-op writes; writer defaults to bulk_update; the VALUES path is Postgres-gated with a bulk_update fallback, and is additionally gated behind DD_MASS_HASH_CODE_USE_SQL_WRITER (default True) so it can be disabled per deployment without a code change.

…lnerability_ids mass_model_updater optimizations for the hash-recompute / dedupe write path: - skip_unchanged (default True): rows whose tracked fields were not changed by the update function are no longer written. Re-runs over already-correct data issue zero UPDATEs. Compared against values loaded by the page query (read from __dict__ so deferred fields don't trigger an extra query). - writer hook: optional `writer(model_type, batch, fields)` callable to replace Django's bulk_update for a batch. Defaults to bulk_update. - hashcode_values_writer (dojo/finding/deduplication.py): writes the text hash columns with a single `UPDATE ... FROM (VALUES ...)` join instead of bulk_update's per-row CASE/WHEN expression tree, which dominates wide-batch updates. PostgreSQL only; falls back to bulk_update otherwise. Values are bound as parameters. - Finding.get_vulnerability_ids now reads the prefetch-aware reverse relation (finding.vulnerability_id_set.all()) instead of a fresh Vulnerability_Id.objects.filter(...) + .count(), so prefetch_related is honored (was a per-finding N+1 for parsers whose hash config includes vulnerability_ids). - dedupe command: prefetch vulnerability_id_set and use the VALUES writer for the hash pass. Adds unittests/test_mass_model_updater.py covering skip-unchanged, the writer hook, the VALUES writer (incl NULL handling) and fields=None side-effect mode.

The new TestMassModelUpdater hardcoded fixtures = ["dojo_testdata.json"], which fails under V3_FEATURE_LOCATIONS=True: loading the legacy Endpoint rows trips the model's V3 deprecation guard during tagulous retag. Decorate with @versioned_fixtures (the established repo pattern) so it loads dojo_testdata_locations.json under V3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

mtesauro

Approved

…ting - Extract bulk_update into default_mass_model_writer so the default mass_model_updater write path is a named callable. - Add DD_MASS_HASH_CODE_USE_SQL_WRITER (default True). When False, the dedupe hash_code pass falls back to bulk_update instead of the PostgreSQL VALUES-join writer.

github-actions Bot added the unittests label Jun 19, 2026

valentijnscholten modified the milestones: 3.0.100, 3.1.0 Jun 19, 2026

valentijnscholten marked this pull request as ready for review June 21, 2026 06:49

valentijnscholten requested review from Maffooch and mtesauro as code owners June 21, 2026 06:49

valentijnscholten added the affects_pro PRs that affect Pro and need a coordinated release/merge moment. label Jun 21, 2026

mtesauro approved these changes Jun 21, 2026

View reviewed changes

github-actions Bot added the settings_changes Needs changes to settings.py based on changes in settings.dist.py included in this PR label Jun 21, 2026

Maffooch approved these changes Jun 23, 2026

View reviewed changes

Maffooch requested review from blakeaowens and dogboat June 23, 2026 20:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(dedupe): skip unchanged rows, VALUES fast-write, prefetch vulnerability_ids#15046

perf(dedupe): skip unchanged rows, VALUES fast-write, prefetch vulnerability_ids#15046
valentijnscholten wants to merge 3 commits into
DefectDojo:devfrom
valentijnscholten:perf/dedupe-skip-unchanged-and-values-write

valentijnscholten commented Jun 19, 2026 •

edited

Loading

Uh oh!

mtesauro left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

valentijnscholten commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

mass_model_updater (dojo/utils.py)

VALUES fast-write (dojo/finding/deduplication.py)

Setting: DD_MASS_HASH_CODE_USE_SQL_WRITER (dojo/settings/settings.dist.py)

vulnerability_ids N+1 (dojo/models.py)

dedupe command (dojo/management/commands/dedupe.py)

Why

Tests

Compatibility

Uh oh!

mtesauro left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

valentijnscholten commented Jun 19, 2026 •

edited

Loading

`mass_model_updater` (`dojo/utils.py`)

VALUES fast-write (`dojo/finding/deduplication.py`)

Setting: `DD_MASS_HASH_CODE_USE_SQL_WRITER` (`dojo/settings/settings.dist.py`)

vulnerability_ids N+1 (`dojo/models.py`)

dedupe command (`dojo/management/commands/dedupe.py`)